World Applied Sciences Journal 2 (4): 323-332, 2007
ISSN 1818-4952
© IDOSI Publications, 2007
Corresponding Author: Dr. Sorana D. Bolboaca, “Iuliu Ha ieganu” University of Medicine and Pharmacy, Cluj-Napoca,
Romania
323
Data Mining on Structure-Activity/Property Relationships Models
Sorana D. Bolboaca and Lorentz Jäntschi
1 2
“Iuliu Ha ieganu” University of Medicine and Pharmacy, Cluj-Napoca, Romania
1
Technical University of Cluj-Napoca, Romania
2
Abstract: Molecular descriptors family on structure-activity/property relationships studies were carried out in
order to identify the link between compounds structure and their activity/property. A number of fifty-five
classes of properties or activities of different compounds sets were investigated. Single and multi-varied linear
regression models using molecular descriptors as variables were identified. The models with estimation and
prediction abilities and associated characteristics were stored into a database. A data mining analysis using
classification and clustering were applied on the obtained database for searching and extracting useful
information. The methodology applied in searching and extracting for information and the obtained results
are presented.
Key words: Knowledge-Discovery in Database (KDD) % cluster analysis % Structure-Activity/Property
Relationships (SAR/SPR) %Molecular Descriptors Family (MDF)
INTRODUCTION index [9], relative response factor [11], molar refraction
Data mining (DM), also called Knowledge-Discovery (insecticidal activity [16], herbicidal activity [17],
in Databases (KDD) or Knowledge-Discovery and Data antioxidant efficacy [18], inhibition activity [19-21],
Mining, is the process of automatically searching large toxicity [22, 23], antituberculotic activity [24] and
volumes of data for patterns using tools such as antimalarial activity [25]) have been reported. In addition,
classification, association rule mining and/or clustering. the overall results from the use of molecular descriptors
The term has been defined as the nontrivial extraction of family on structure property/activity relationships has
implicit, previously unknown and potentially useful also been published [26].
information from data [1], being considered as the science The best performing models in terms of correlation
of extracting useful information from large data sets or coefficients and cross-validation scores were collected
databases [2]. into a database. On this amount of information, data
Data mining techniques are use in search of mining techniques have been applied in order to identify
consistent patterns and/or systematic relationships consistent patterns and/or relationships between
between variables in business [3], evaluation of variables of MDF SAR/SPR models.
web-based educational programs [4], computer
science [5], chemistry [6], engineering [7], medicine [8] MATERIAL
and in all domains where a large amount of date must
be analyzed. A number of fifty-five sets of compounds were
A new method of quantitative structure- included into analysis. The set abbreviation, activity or
activity/property relationships abbreviated as MDF property of interest and class of compounds are
SAR/SPR (molecular descriptors family on the structure- presented in Table 1.
activity/property relationships) has been introduced by Univariate and multivariate models were obtained
Jäntschi in 2004 [9] and reviewed in 2005 [10]. Since then, by applying the MDF SAR/SPR methodology on the
samples of compounds with different properties or samples of compounds; the models were stored into a
activities have been investigated and analyzed. Some database. The molecular descriptors are the variables
results on different properties (retention chromatography used by the models.
[12], octanol/water partition coefficient [13-15] or activities